117 research outputs found

    A new pairwise kernel for biological network inference with support vector machines

    Get PDF
    International audienceBACKGROUND: Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). RESULTS: Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel. This new kernel for pairs can easily be used by most SVM implementations to solve problems of supervised classification and inference of pairwise relationships from heterogeneous data. We demonstrate, using several real biological networks and genomic datasets, that this approach often improves upon the state-of-the-art SVM for indirect inference with another pairwise kernel, and that the combination of both kernels always improves upon each individual kernel. CONCLUSION: The metric learning pairwise kernel is a new formulation to infer pairwise relationships with SVM, which provides state-of-the-art results for the inference of several biological networks from heterogeneous genomic data

    Modeling recursive RNA interference.

    Get PDF
    An important application of the RNA interference (RNAi) pathway is its use as a small RNA-based regulatory system commonly exploited to suppress expression of target genes to test their function in vivo. In several published experiments, RNAi has been used to inactivate components of the RNAi pathway itself, a procedure termed recursive RNAi in this report. The theoretical basis of recursive RNAi is unclear since the procedure could potentially be self-defeating, and in practice the effectiveness of recursive RNAi in published experiments is highly variable. A mathematical model for recursive RNAi was developed and used to investigate the range of conditions under which the procedure should be effective. The model predicts that the effectiveness of recursive RNAi is strongly dependent on the efficacy of RNAi at knocking down target gene expression. This efficacy is known to vary highly between different cell types, and comparison of the model predictions to published experimental data suggests that variation in RNAi efficacy may be the main cause of discrepancies between published recursive RNAi experiments in different organisms. The model suggests potential ways to optimize the effectiveness of recursive RNAi both for screening of RNAi components as well as for improved temporal control of gene expression in switch off-switch on experiments

    Classification of microarray data using gene networks

    Get PDF
    BACKGROUND: Microarrays have become extremely useful for analysing genetic phenomena, but establishing a relation between microarray analysis results (typically a list of genes) and their biological significance is often difficult. Currently, the standard approach is to map a posteriori the results onto gene networks in order to elucidate the functions perturbed at the level of pathways. However, integrating a priori knowledge of the gene networks could help in the statistical analysis of gene expression data and in their biological interpretation. RESULTS: We propose a method to integrate a priori the knowledge of a gene network in the analysis of gene expression data. The approach is based on the spectral decomposition of gene expression profiles with respect to the eigenfunctions of the graph, resulting in an attenuation of the high-frequency components of the expression profiles with respect to the topology of the graph. We show how to derive unsupervised and supervised classification algorithms of expression profiles, resulting in classifiers with biological relevance. We illustrate the method with the analysis of a set of expression profiles from irradiated and non-irradiated yeast strains. CONCLUSION: Including a priori knowledge of a gene network for the analysis of gene expression data leads to good classification performance and improved interpretability of the results

    In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles

    Get PDF
    Background: In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task. Results: Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms. Conclusion: Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared.12 page(s

    Multi-Target Prediction: A Unifying View on Problems and Methods

    Full text link
    Multi-target prediction (MTP) is concerned with the simultaneous prediction of multiple target variables of diverse type. Due to its enormous application potential, it has developed into an active and rapidly expanding research field that combines several subfields of machine learning, including multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. In this paper, we present a unifying view on MTP problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research

    Application of kernel functions for accurate similarity search in large chemical databases

    Get PDF
    Background Similaritysearch in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. Results To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep. Conclusions Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases

    The impact of cyclin-dependent kinase 5 depletion on poly(ADP-ribose) polymerase activity and responses to radiation

    Get PDF
    Cyclin-dependent kinase 5 (Cdk5) has been identified as a determinant of sensitivity to poly(ADP-ribose) polymerase (PARP) inhibitors. Here, the consequences of its depletion on cell survival, PARP activity, the recruitment of base excision repair (BER) proteins to DNA damage sites, and overall DNA single-strand break (SSB) repair were investigated using isogenic HeLa stably depleted (KD) and Control cell lines. Synthetic lethality achieved by disrupting PARP activity in Cdk5-deficient cells was confirmed, and the Cdk5KD cells were also found to be sensitive to the killing effects of ionizing radiation (IR) but not methyl methanesulfonate or neocarzinostatin. The recruitment profiles of GFP-PARP-1 and XRCC1-YFP to sites of micro-irradiated Cdk5KD cells were slower and reached lower maximum values, while the profile of GFP-PCNA recruitment was faster and attained higher maximum values compared to Control cells. Higher basal, IR, and hydrogen peroxide-induced polymer levels were observed in Cdk5KD compared to Control cells. Recruitment of GFP-PARP-1 in which serines 782, 785, and 786, potential Cdk5 phosphorylation targets, were mutated to alanines in micro-irradiated Control cells was also reduced. We hypothesize that Cdk5-dependent PARP-1 phosphorylation on one or more of these serines results in an attenuation of its ribosylating activity facilitating persistence at DNA damage sites. Despite these deficiencies, Cdk5KD cells are able to effectively repair SSBs probably via the long patch BER pathway, suggesting that the enhanced radiation sensitivity of Cdk5KD cells is due to a role of Cdk5 in other pathways or the altered polymer levels

    Statistical learning of peptide retention behavior in chromatographic separations: a new kernel-based approach for computational proteomics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput peptide and protein identification technologies have benefited tremendously from strategies based on tandem mass spectrometry (MS/MS) in combination with database searching algorithms. A major problem with existing methods lies within the significant number of false positive and false negative annotations. So far, standard algorithms for protein identification do not use the information gained from separation processes usually involved in peptide analysis, such as retention time information, which are readily available from chromatographic separation of the sample. Identification can thus be improved by comparing measured retention times to predicted retention times. Current prediction models are derived from a set of measured test analytes but they usually require large amounts of training data.</p> <p>Results</p> <p>We introduce a new kernel function which can be applied in combination with support vector machines to a wide range of computational proteomics problems. We show the performance of this new approach by applying it to the prediction of peptide adsorption/elution behavior in strong anion-exchange solid-phase extraction (SAX-SPE) and ion-pair reversed-phase high-performance liquid chromatography (IP-RP-HPLC). Furthermore, the predicted retention times are used to improve spectrum identifications by a <it>p</it>-value-based filtering approach. The approach was tested on a number of different datasets and shows excellent performance while requiring only very small training sets (about 40 peptides instead of thousands). Using the retention time predictor in our retention time filter improves the fraction of correctly identified peptide mass spectra significantly.</p> <p>Conclusion</p> <p>The proposed kernel function is well-suited for the prediction of chromatographic separation in computational proteomics and requires only a limited amount of training data. The performance of this new method is demonstrated by applying it to peptide retention time prediction in IP-RP-HPLC and prediction of peptide sample fractionation in SAX-SPE. Finally, we incorporate the predicted chromatographic behavior in a <it>p</it>-value based filter to improve peptide identifications based on liquid chromatography-tandem mass spectrometry.</p
    corecore